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Abstract 



Much of the current research concerning reliability is emphatically suggesting that 
researchers gather their own reliability estimates when administering an instrument. It has also 
been recommended that data with low reliability then be discarded. While some data obtained 
from instruments that originally yielded reliable results may be unreliable, it does not necessarily 
follow that the data are unuseful to researchers. This paper will contend that although data from a 
homogeneous sample might yield less reliable scores than did an inducted sample, these data 
should not be discarded until further examination of the data is conducted. The authors will 
present two statistics for monitoring data homogeneity and one statistic for correcting alpha 
when homogeneity is large. 
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The Introduction of a Measure of Instrument Homogeneity for Interpreting Low Reliability 

Coefficients 

Much of the current literature concerning reliability is emphatically suggesting that 
researchers obtain their own reliability estimates when gathering new data from a previously- 
developed instrument and then report these estimates (Vacha-Haase, Kogan, & Thompson, in 
press) and confidence intervals (Onwuegbuzie & Daniel, 2000) in their final reports. Moreover, 
Thompson and Vacha-Haase (2000), encouraged researchers not to “induct” reliability estimates 
for their own dataset from previous studies, but to obtain reliability estimates for their own 
dataset because reliability estimates are affected by individual sample characteristics. 
Concerning this practice of “inducting” reliability estimates, Pedhazur and Schmelkin (1991) 
stated: 

[Reliability estimates printed in test manuals] may be useful for comparative 
purposes, but it is imperative to recognize that the relevant reliability estimate is 
the one obtained for the sample used in the [current] study under consideration. 

(p. 86) 

Similarly, Dawis (1987) contended that although the type of instrument utilized may influence 
reliability, reliability also might be influenced by sample composition and sample variability. 
Therefore, it is imperative that researchers compute and interpret reliability coefficients for their 
underlying sample. 

Noting the problems with interpreting alpha as a consistent measure of reliability, 
Pedhazur and Schmelkin (1991) stated, “coefficient alpha will underestimate the reliability of a 
measure when its items are not at least essentially tau-equivalent” (p. 100). Pedhazur and 
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Schmelkin (1991) then give a brief discussion of the problems associated with interpreting alpha 
with differing test lengths (number of items), with restricting time allotted to administer the test, 
and when instrument homogeneity (unidimensionality) is high. The last of these problems, 
homogeneity, is the focus of the present essay. 

Coefficient Alpha 

For the purposes of this paper, we will use the formula for coefficient alpha that was 
originally developed by Cronbach (1951). For reference purposes, the formula for computing 
alpha is 



a = 



k - 1 






2 A 



'x J 



( 1 ) 



where k is the number of items, Eof is the sum of the individual test item variances, and <j\ is 



the total test variance. Although typically Kuder Richardson-20 is used with dichotomously- 
scored items, it should be noted that alpha equals KR-20 because each kth p^qk equals each ith 
of , and across all k items ^p k q k = Zcrf (Thompson, 1999). Consulting Equation 1, a large 



alpha would be yielded from scores that necessarily had a small sum of individual item variances 
and a large total test variance. Likewise, a dataset yielding a small alpha coefficient would be 
produced by scores that had high individual item variances and a small total test variance. It 
should also be noted that although conceptually alpha represents a squared metric, it is 
mathematically possible to obtain an alpha value less than zero. This occurs when the sum of the 
individual item variances exceeds the total test variance. 

Consider the following heuristic example of datasets that would yield a low reliability 
estimate. The first example is a measure given in which all of the students simply guessed at 
responses to items and the variance of the individual items was as large or even larger than the 
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total test variance. This might be the case if an instrument like the Scholastic Assessment Test 
was given to second graders. The second example is when there is a lack of variability among 
examinees. For example, if a second-grade spelling test was administered to college English 
majors, all of the examinees would probably score close to the same value, thus yielding a small 
alpha. 

It has been recommended consistently in the literature that scores yielding low reliability 
estimates either be considered extremely suspect or discarded (Abelson, 1997). While in many 
instances, scores that yield low reliability coefficients indicate poor psychometric properties, it 
does not necessarily follow that the underlying data are always “unuseful” to researchers. 
Consider an example when a depression measure is administered to students who are all 
relatively not depressed, that is, who represent a homogeneous sample with respect to depression 
scores. Such scores likely would yield a low reliability estimate. However, interpreting this 
reliability coefficient without taking into consideration the homogeneous nature of the sample 
would be misleading. Rather, what is needed in such instances is not to discard the data, but to 
examine the scores further to determine why low reliability estimates were obtained from an 
instrument that produced high reliability estimates in the original test sample. In an attempt to 
overcome problems associated with low reliability estimates in homogeneous samples, 
Magnusson (1967) offered a formula based on the sample variance from the original normative 
group (i.e., the inducted sample) and the underlying sample group. Magnusson’s (1967) formula 
can be used to predict the reliability of scores from a present sample, based on the reliability of 
scores from the inducted sample and the standard deviations of both samples. 
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Heuristic Example 

For the purposes of the present article, two heuristic datasets were utilized that were both 
(hypothetically) derived from the same eight-item instrument. The first dataset was designed 
such that the scores reflected what a homogeneous sample might look like if drawn from a 
population of scores. It should be noted that there are slight variations within each item because 
including data in which all of the examinees had the same score contributes nothing to the 
variability of the data. The second dataset was generated to represent a completely random set of 
scores. The data used are illustrated in Tables 1 and 2. (We would like to bring to the reader’s 
attention the fact that we are not advocating that such small samples be utilized, due to their 
ensuing low statistical power for detecting relationships. These small datasets are used for 
illustrative purposes only.) 



Insert Tables 1 and 2 about here 



Once data were generated, reliability analyses were conducted using the Statistical 
Package for the Social Sciences (SPSS; SPSS Inc., 2000). Results from the reliability analyses 
are provided in Table 3. The second column in Table 3 illustrates the variance and reliability 
estimates for the random data in Table 2. The third column represents the variance and 
reliability estimates for the homogeneous data in Table 1. In the fourth column, a hypothetical 
dataset also has been included to simulate plausible test manual data. We will use this dataset for 
comparative purposes with our other two random and homogeneous datasets. 



Insert Table 3 about here 
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From Table 3, we see that the alpha coefficients from the random, homogeneous, and 
inducted dataset are .097, .071, and .883, respectively. In this example, although the inducted 
scores yielded a high reliability coefficient, scores from the two heuristic datasets yielded low 
reliability estimates. The reason that alpha on these two datasets is low is because the sum of the 
item variances is almost as great as the total test variance. In the inducted dataset, the sum of the 
item variances is small, whereas the total test variance is large, thereby yielding a large alpha 
coefficient. 

The mean item variances in Table 3 are the overall mean of the eight item variances for 
each dataset. These values are .2408, .0588, and .1658 for the random, homogeneous, and 
inducted datasets, respectively. We see that the mean item variances effectively discriminate the 
random dataset from the homogeneous dataset. Specifically, whereas the mean item variance for 
the random dataset is larger than that for the inducted sample, the mean item variance for the 
homogeneous dataset is smaller than that for the inducted sample. 

The variance of the mean item variances (Table 3) is simply the squared standard 
deviation (a 2 ) of the mean item variances. These values are .0001, .0000, and .0037 for the 
random, homogeneous, and inducted datasets, respectively. It should be noted that a small 
variance of the mean item variances indicates that the mean item variances are relatively the 
same (e.g., either all small or all large). In each of the three datasets presented here, the variance 
of the mean item variances is small, thus indicating that nearly all of the item variances, within 
each dataset, are clustered relatively close together. Therefore, the variance of the mean item 
variances does not adequately discriminate the random and homogeneous samples. 
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Also presented in Table 3 is the ratio of the sum of the individual test item variances to 
the total test variance for the random, homogeneous, and inducted datasets. These values are 
.9132, .9414, and .2211, respectively. This ratio, which represents the last entry of Equation 1 
above, is the most important component of an alpha coefficient. Interestingly, these ratios do not 
adequately discriminate the random dataset from the homogeneous dataset. 

Also presented in Table 3 is the square of the standard error of measurement. The 
standard error of measurement, or the standard deviation of errors of measurement, provides an 
absolute rather than a relative measure of the extent to which raw and true scores are equivalent 
(Crocker & Algina, 1986). Although the true score can never be known, the standard error of 
measurement can be applied to an individual’s observed score to set “plausible limits” for 
locating the true score. These “plausible limits” provide confidence bands for interpreting an 
obtained score. The smaller the standard error of measurement, the smaller the confidence band, 
and the greater the confidence that the observed score is near the true score. From Table 3, we 
see that the squared standard error of measurement effectively discriminates the random dataset 
from the homogeneous dataset. Specifically, whereas the squared standard error of measurement 
for the random dataset is larger than that for the inducted sample, the squared standard error of 
measurement for the homogeneous dataset is smaller than that for the inducted sample. 

Statistics for Detecting Sample Homogeneity 

Although Magnusson (1967) had previously defined methods for predicting alpha given 
psychometrics from original test manuals, not until recently have methods been developed for 
detecting sample homogeneity. Roberts and Onwuegbuzie (2000) have developed two statistics 
that investigate data homogeneity, given previous test psychometrics. The first of these statistics 
is computed as the difference in mean item variances. This statistic can be expressed as a 
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percentage to yield what they termed the “relative mean item variance index.” The relative mean 
item variance index is computed as 

(r 

i r 2 V Z > 

Mg u 



where Mg] is the mean item variance for the inducted sample and Mo 2 , is the mean item 



variance for the underlying sample. This index ranges from -oo to 1, with positive values 
indicating that the study sample is more homogeneous than the inducted sample, and negative 
values indicating that the study sample is less homogeneous than is the original sample. This 
index can be expressed as a percentage by multiplying the index by 100. For the present sample, 
the random group has a relative mean item variance index of -.4524, whereas the index 
pertaining to the homogeneous dataset is .6454. The negative relative index associated with the 
random dataset suggests that the low reliability index is not the result of homogeneity. 
Conversely, the relative index value pertaining to the homogeneous dataset indicates that the low 
reliability coefficient obtained for the homogeneous dataset is explained to some extent by the 
relative homogeneous nature of the dataset. In this latter case, it could be argued that the low 
reliability coefficient represents a statistical artifact. 

The second statistic developed by Roberts and Onwuegbuzie (2000) was termed the 
“relative squared standard error of estimate index,” and can be computed as 



where s 2 is the squared standard error of estimate from inducted sample and s 2 . is the squared 

standard error of estimate from underlying sample. This index also ranges from -oo to 1, with 
positive values indicating that the study sample is more homogeneous than is the inducted 
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sample, and negative values indicating that the study sample is less homogeneous than is the 
inducted sample. For the present heuristic example, the random group has a relative squared 
standard error of estimate index of -1.7140, whereas the index pertaining to the homogeneous 
dataset is .3382. As for the relative mean item variance index, the negative relative squared 
standard error of estimate index associated with the random dataset suggests that the low 
reliability index is not the result of homogeneity. Further discussion of these statistics is 
presented in Roberts and Onwuegbuzie (2000). 

Alpha ROE for Dichotomously Scored Instruments 

Although the relative mean item variance index and the relative squared standard error of 
estimate index are both useful statistics when investigating data obtained from a previously 
published instrument, it does not always stand to reason that test publishers will report item 
variances and/or standard error of the estimate for the inducted sample. Until now, there has 
been no way of correcting alpha in samples where homogeneity is large and access to a 
previously-normed sample is not available. 

We define a new measure, which we will call oiroe- With N = the number of test 
questions, «, being the number of respondents to each test question, and a it bi being the number 
of students who answered question i, correctly and incorrectly (respectively), we have 




( 4 ) 



Since a, + b, = n, for each question, the we can substitute to get 




( 5 ) 



In some sense, this is a variance measure, since 
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However, it is only natural to investigate what the range of this new a would be. Supposing that 
everyone answers every question with a response of “1”, then a, = «,Vz . Hence, 
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On the other hand, suppose that a, = 6,. = — (i.e. each question has an equal number of 



people answering a and b ). Then, 
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( 8 ) 



So we see that for a dataset with perfect homogeneity, oiroe receives a value of “1”, whereas 
with a perfectly heterogeneous dataset, oiroe receives a value of “0”. Thus we have the bounds 
of aROE- Therefore, in any given dataset, oiroe is measuring the amount of variance from 



n, 



complete data heterogeneity, since ~z~ a in a dataset of complete heterogeneity. Thus, 



intuitively, we are measuring how far the obtained data is from complete dataset heterogeneity. 

Applying this formula to the current “homogeneous” data set we can see that for each of 
the eight items, 1 person out of 16 has a score differing from the others (note: this was setup to 
provide some variance in the dataset so that Cronbach’s alpha could be computed). Thus, for 
otROE for this dataset, we see that the square of the number of items correct for each item is 0.766 
across all eight items. Therefore (Xroe equals 0.766. 
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Likewise, we can compute the cxroe for the “random” dataset by applying the above 
formula. In doing so we have 
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(9) 



or 



.016 + .016 + 0 + .141 + .016 + .016 + .016 + .063 

8 



( 10 ) 



such that cxroe is .036 for the “random” dataset. 

When investigating datasets with low reliability, computing cxroe seems to be the best 
and easiest means of investigating attenuation in reliability estimates due to data homogeneity. 
While the relative mean item variance index and the relative squared standard error of the 
estimate index helps determine the relative homogeneity based on the original inducted sample, 
oiroe seems to be a more appropriate statistic for use when the test manual psychometrics are not 
available (or complete) and for use when no other reference dataset is available. 

Because a R0 E is a squared metric, it has a lower bound of zero. And since the total 
number correct for any given item cannot exceed the number of respondents, the upper bound for 
this statistic is one. Consequentially, when consulting a R0 E, it can be interpreted in the same 
manner as Cronbach’s alpha. For example, in the case where coefficient alpha were .80, we 
could state that at least 80% of the total test score variance is due to true score variance. In the 
case of cxroe, if &roe = -80, we could say that at least 80% of the lack of total test score variance 
(and individual item variance) is due to examinee homogeneity. 
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Real-Life Examples Illustrating When Homogeneity Yields Meaningful and Worthless Low 
Reliability Coefficients 

One of the main concerns posited in this paper is that researchers, when confronted with 
data that yield a low alpha coefficient, should seek to investigate the reason(s) for this low 
reliability estimate. We believe that in the case where the data are homogeneous, it does not 
follow that simply discarding the data is merited. Consider the two following examples of data 
that would yield low reliability estimates. The first example we will refer to as an illustration of 
“bad” homogeneity, and the second example we will refer to as an illustration of “good” 
homogeneity. 

After administering an examination, a researcher discovers that the dataset she is 
investigating had low reliability because an achievement test had been administered to a group of 
sixth graders that was originally intended for second graders. As a result, the entire cohort of 
students achieved high scores on the examination. In this instance, it could be justifiably 
contended that this instrument yields unreliable data for this specific sample. As has been 
mentioned previously, this does not mean that the instrument is unreliable (instruments are 
neither reliable nor unreliable), but simply that the instrument is inappropriate for use with this 
cohort. This is an example of what is meant by “bad” homogeneity, or homogeneity that should 
consequentially lead to a discarding of the data. 

The second example represents a different but equally realistic scenario. Suppose that a 
depression scale has been administered to 45 patients of an outpatient clinic for depression. 
Researchers have administered this instrument because they are concerned with the effectiveness 
of a certain intervention and are measuring the gain scores on the depression scale. In this case, 
it would be expected that the reliability estimates would be very small for the 45 patients who are 
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relatively homogeneous because they are in a depression clinic being treated for depression. 
However, simply disposing with the data in this instance seems unwarranted if indeed low 
reliability is due to homogeneity and not to randomness. The goal of administering the 
instrument to these patients is to monitor the effectiveness of the intervention. In research 
studies like this, the effect size will be maximized if and only if participants from one extreme of 
the scale (e.g., clinically-depressed participants) are the focus of the study. Otherwise, ceiling 
effects could confound results, thereby providing rival explanations (i.e., low internal validity). 
Additionally, findings from such an investigation would only be generalizable (i.e., have 
maximal external validity) if a clinically-depressed sample is utilized. This is an example of 
“good” homogeneity, or homogeneity that attenuates the reliability estimate in a manner that is 
directly a function of the level of homogeneity of the sample. 

In the second example, it is extremely likely that the mean item variances were smaller 
than the mean item variances reported in the test manual. It is also very likely that (Xroe is very 
large since all examinees scored roughly the same on most of the administered items. If this 
were the case, the data should still be used in the analysis, particularly if the sample size was 
large, because the low reliability estimate is due to individual homogeneity and thus appears 
acceptable considering the context of the study. Moreover, with respect to the homogeneous 
dataset presented in Table 1 , it should be noted that if only the reliability coefficient in Table 3 
had been reported, and further examination of the reliability properties had not been performed, 
readers of the subsequent final report might not have looked favorably on scores that yielded a 
reliability coefficient of .071. However, adding an explanation of the reasons to the possibility 
of a small coefficient alpha, such as an <xroe of .766, would strengthen the researchers argument 
for using the existing data for further analyses. 




15 



Alternative Approaches for Interpreting Alpha 1 5 



Conclusion 

Although the major focus of this paper has been to advocate that researchers spend time 
examining the reasons behind data yielding low reliability estimates, it should be noted that, 
previously, no correction statistics existed for alpha coefficients with homogeneous samples. 

This can and should be the focus of future research in this area. While we encourage the 
investigation of examinee homogeneity through (Xroe , the mean item variances, and the squared 
standard error of measurement, no specific guidelines have been developed for determining what 
is and what is not an acceptable threshold for data homogeneity. However, as illustrated above, 
both ccroe and the (squared) standard error of measurement provide extremely useful information 
about the degree of homogeneity of an underlying sample. 

Because low reliability estimates reduce the statistical power associated with hypothesis 
tests (Onwuegbuzie & Daniel, 2000), researchers should utilize larger samples, when they expect 
that their sample is homogeneous, in order to compensate for the corresponding reliability-based 
loss in statistical power. In fact, increasing the sample size (i.e., a research-based consideration) 
would probably provide a better correction for attenuated relationships than any statistical 
correction of the reliability coefficient itself (i.e., a statistical consideration), just as randomizing 
participants to treatment conditions in an experimental design is superior to using analysis of 
covariance techniques or other statistical adjustments to analyze data stemming from non- 
experimental research designs. 

Nevertheless, encouraging the investigation of coefficients of reliability can only help the 
growth of accountability in test usage. It is our contention that researchers should be encouraged 
not only to compute reliability estimates for their own data, but also to investigate and to explain 
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why the reliability estimates differ from the original test manual norming (i.e., inducted) sample. 
Such information would allow readers to put subsequent findings in a more appropriate context. 
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Table 2 

Random dataset 
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Table 3 

Item and Test Characteristics for the Two Heuristic Datasets and the “Inducted” Dataset 



Variables 


Random 


Homogeneous 


“Inducted” 




Dataset 


Dataset 


Dataset 


Item 1 variance 


0.2461 


0.0588 


0.1094 


Item 2 variance 


0.2461 


0.0588 


0.1875 


Item 3 variance 


0.2500 


0.0588 


0.2125 


Item 4 variance 


0.2113 


0.0588 


0.2148 


Item 5 variance 


0.2461 


0.0588 


0.2461 


Item 6 variance 


0.2461 


0.0588 


0.1875 


Item 7 variance 


0.2461 


0.0588 


0.0588 


Item 8 variance 


0.2644 


0.0588 


0.1094 


E item variances 


1.9262 


0.4704 


1.3260 


Mean item variance 


0.2408 


0.0588 


.1658 


Variance of the mean 
item variances 


0.0001 


0.0000 


0.0037 


Total test variance 


2.1094 


0.5000 


5.9963 


E item variances/total 
test variance 


0.9132 


0.9408 


0.2211 


Alpha coefficient 


0.0974 


0.0714 


0.8830 


(Standard error of 
measurement) 2 


2.0309 


0.4952 


0.7483 


Relative mean item 
variance index 


-0.4524 


0.6454 




Relative squared 
standard error of 
estimate index 


-1.7141 


0.3382 





Note . All variance components were computed with the population size (N) and not the 
corrected sample size (N-l). 
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